New York is a state located in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With more than 19 million residents in 2019,[6] it is the fourth most populous state. In order to distinguish the state from its city with the same name which is located within the state, it is sometimes referred to as New York State
https://en.wikipedia.org/wiki/New_York_(state)
There are 62 counties in the state of New York. The most recent county formation in New York was in 1914, when Bronx County was created from the portions of New York City that had been annexed from Westchester County in the late 19th century and added to New York County. https://en.wikipedia.org/wiki/List_of_counties_in_New_York
As a newcomer in finance industry, I think cash is has been, but still want to deploy new ATM structures inside NY state.
I'm looking for places where people are likely to withdraw money AND looking for places where people don't have yet any other good option to do so.
Using data science, I want to analyze ATM data, understand where the needs are. I want to understand how my competitors thinks about it. Given that, I want to find where to deploy my new pounds in order to bring the most value out of it.
This work will be driven by data, more than any prior knowledge of surrounding context.
As a european who has only seen New york for a few days, I don't know much about how things are organized in US.
I've disovered the "state, county, city" organization during that course and know nothing about the finance industry.
Data I've found publicly available will drive my analysis, but I'm very aware that someone with closer relation to New-York and/or finance, will have different questions, understanding and expectations.
For the exercise I run into, understanding the context is king, all the machine-leaning in the world will not compensate for that, but yes this is an exercise.
import pandas as pd
import numpy as np
import folium
from geopy.geocoders import Nominatim
import seaborn as sns
from sklearn.cluster import KMeans
MAP_CENTER=[41, -73.125]
NY_COUNTIES_GEO="https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/new-york-counties.geojson"
map_newyork = folium.Map(location = MAP_CENTER, zoom_start = 6)
map_newyork.choropleth(geo_data = NY_COUNTIES_GEO,
fill_opacity=0.1,
line_opacity=1,
legend_name = "New York counties")
map_newyork
df=pd.read_csv("Bank-Owned_ATM_Locations_in_New_York_State.csv")
df.head()
latlon=df["Location 1"].str.split("\n", n = 3, expand = True)[2].str.replace("(","").str.replace(")","").str.split(",",n=2,expand= True)
df["lat"]=latlon[0]
df["lon"]=latlon[1]
df.head()
print(df.shape)
df.dropna(subset = ["lat","lon"]).shape
df.dropna(subset = ["lat","lon"],inplace=True)
We have extracted latitude and longitude into new columns
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode("5518 8th Avenue, Brooklyn, NY 11220")
location
map_newyork = folium.Map(location = MAP_CENTER, zoom_start = 9)
for lat, lng, name, addr in zip(df['lat'], df['lon'], df['Name of Institution'], df['County']):
if lat is not None and pd.notna(lat):
label = '{}, {}'.format(name, addr)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=1,
popup=label,
color='blue',
fill=False,
fill_color='#3186cc',
fill_opacity=0.4,
parse_html=False).add_to(map_newyork)
map_newyork
Exploring this point cloud we can confirm that ATM locations are all located around New-York state.
Some regions seems to have a much higher density of ATMs that other.
#County names used in geojson are not the same as the one we have in CSV
#we need to adapt our dataset
#we create the column "County_idx" to be able to join geodata
NY_COUNTIES_GEO_KEY="feature.properties.name"
df["County_idx"]=df["County"].map(str)+" County"
df["County_idx"]=df["County_idx"].replace('Montogomery County','Montgomery County')
count_by_county= df["County_idx"].value_counts().reset_index();
count_by_county.head()
map_newyork = folium.Map(location = MAP_CENTER, zoom_start = 6)
map_newyork.choropleth(geo_data = NY_COUNTIES_GEO,
key_on = NY_COUNTIES_GEO_KEY,
data =count_by_county, columns = ["index", "County_idx"],
fill_opacity=0.9,
line_opacity=1,
legend_name = "ATM density per county")
# add markers to map
for lat, lng, name, addr in zip(df['lat'], df['lon'], df['Name of Institution'], df['County_idx']):
if lat is not None and pd.notna(lat):
label = '{}, {}'.format(name, addr)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=2,
popup=label,
color='red',
fill=True,
fill_color='#3186cc',
opacity=0.1,
parse_html=False).add_to(map_newyork)
map_newyork
Manathan get it's own class with an unusally high amount of ATMs (deep blue)
The area of a county is probably not relevant for explaining ATM location, many large counties have low number of ATMs.
pop = pd.read_html("https://www.newyork-demographics.com/counties_by_population",
header=0,
attrs={"class":"ranklist"})[0]
pop.drop("Rank",axis=1,inplace=True)
pop.drop(62,axis=0,inplace=True)
pop["Population"]=pd.to_numeric(pop["Population"], errors='raise')
pop.head()
df.shape
df.merge(pop, left_on='County_idx', right_on='County').shape
atm_count=df["County_idx"].value_counts().reset_index()
atm_count.rename(columns={'County_idx': 'ATM_count',"index":"County"},inplace=True)
atm_count
dd=atm_count.merge(pop, left_on='County', right_on='County')
dd
sns.distplot(dd["ATM_count"]).set(xlim=(0));
sns.distplot(dd["Population"]).set(xlim=(0));
It seams pretty clear from those charts that both features have a very similar distribution.
dd["ATM_per_hab"]=dd["ATM_count"]/dd["Population"]
sns.distplot(dd["ATM_per_hab"]);
The distribution we observe here is very close to a gaussian.
1/(dd["ATM_per_hab"].mean())
from sklearn.linear_model import LinearRegression
def test_lin_reg(data):
sns.regplot(x="Population", y="ATM_count", data=data)
X = np.array(data["Population"]).reshape(-1, 1)
y = data["ATM_count"]
reg = LinearRegression().fit(X, y)
print("score=",reg.score(X, y))
print("slope=",reg.coef_[0])
print("one ATM per {} habitants".format(round(1/reg.coef_[0])))
data["std_err"]=(data["ATM_count"]-reg.predict(data["Population"].to_numpy().reshape(-1, 1)))**2
print("outliers",data.sort_values("std_err").tail())
test_lin_reg(dd)
test_lin_reg(pd.DataFrame(dd.loc[dd["County"]!="Queens County"]))
business rule of thumb:
Counties on the low tail are where we should build new ATMs
dd.nsmallest(5,"ATM_per_hab")
On the opposite side those are counties with 'too many ATMs' according to our population rule
dd.nlargest(5,"ATM_per_hab")
So far we've found a very strong correlation between county population and number of ATMs.
"Queens County", "Putnam County" needs more ATM, and "Warren County" or "Essex County" have too much.
To validate those findings, we would need to improve our data collection, use geocoder on ATMs without location data and find additionnal ATM data sources.
Population is definitely a major feature, but we also need to take more factors into consideration, before investing in new ATMs
wealth = pd.read_html("https://en.wikipedia.org/wiki/List_of_New_York_locations_by_per_capita_income",
header=0,
attrs={"class":"wikitable sortable"})[0]
wealth.drop("Rank",axis=1,inplace=True)
wealth["County"]=wealth["County"].apply(lambda s: s if(s.endswith("County")) else s+" County" )
wealth["Medianhouseholdincome"]=pd.to_numeric(wealth["Medianhouseholdincome"].str.replace("$","").str.replace(",",""), errors='raise')
ddd=dd.merge(wealth, left_on='County', right_on='County')
ddd.head()
ddd.nlargest(5,"Medianhouseholdincome")
ddd.nsmallest(5,"Medianhouseholdincome")
sns.scatterplot(ddd["Medianhouseholdincome"],ddd["ATM_per_hab"])
Somehow disapointing, This looks like totally uncorellated
Median household income doesn't seem to explain why some counties have more or less ATMs than other.
When it comes to chosing a location for new ATM:
Limitations on that approach:
In competitor analysis we will switch focus to focus on institution rather than county.
We will try to discover patterns in ATM location for each institution, and understand the strategy they use to choose their ATM locations.
May be instituion-B ATMs are always placed close transportation service ?
We will use foursquare API to gather data about suroundings.
If we can understand the strategies that our competitors are using this will give us a great competitive avantage for the future.
We will
To avoid going over quotas on foursquare api
institution=df.groupby(["Name of Institution"])["County"].nunique()\
.sort_values(ascending=False).head(10).keys().to_list()
institution
df1 = df.loc[df["Name of Institution"].isin(institution)]\
.groupby(["Name of Institution","County"], group_keys=False)
t=df1.apply(lambda df: df.sample(1)).groupby("Name of Institution").apply(lambda df:df.sample(5))
t
We have selected 5 random ATMs for each institution and additionnaly we choose them so that they are not all located inside the same county.
t["Name of Institution"].value_counts()
import matplotlib.cm as cm
import matplotlib.colors as colors
map_newyork = folium.Map(location=MAP_CENTER, zoom_start=6)
rainbow = cm.rainbow(np.linspace(0, 1, len(institution)))
rainbowm = dict(zip(institution,rainbow))
for lat, lng, name, addr in zip(t['lat'], t['lon'], t['Name of Institution'], t['County_idx']):
if lat is not None and pd.notna(lat):
label = '{}, {}'.format(name, addr)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=colors.rgb2hex(rainbowm.get(name,[0,0,0])),
fill=True,
fill_color=colors.rgb2hex(rainbowm.get(name,[0,0,0])),
fill_opacity=1,
parse_html=False).add_to(map_newyork)
map_newyork
import requests
import requests_cache
import time
requests_cache.install_cache('foursquare_cache', backend='sqlite', expire_after=60*60*24*14)
CLIENT_ID = 'VEOT5G4R5G5NCEH1EIHTGGWJ4PLQLZNINFK3A10IWD3DRH11' # your Foursquare ID
CLIENT_SECRET = '02QISYXURM1GZFWOTNV4ET4LWXEGS2F22EQHU1MUU50Q0I05' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
def explore_url (lat,lon,radius, limit):
return 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lon,
radius,
limit)
def request(url):
now = time.ctime(int(time.time()))
response = requests.get(url)
print("Time: {0} / Used Cache: {1}".format(now, response.from_cache))
return response.json()
def getNearbyVenues(atm_id, latitudes, longitudes, radius=500, limit=1000):
venues_list=[]
for idx, lat, lng in zip(atm_id, latitudes, longitudes):
url=explore_url(lat, lng, radius, limit)
results = request(url)
results=results["response"]['groups'][0]['items']
venues_list.append([(
idx,
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['atm_idx','category']
return nearby_venues;
def cat_df(i):
sample_atms=t.loc[i]
r=getNearbyVenues(atm_id=sample_atms.index,latitudes=sample_atms['lat'],longitudes=sample_atms['lon'])
tmp=r.groupby("category").count().reset_index()
tmp["institution"]=i
tmp.rename(columns={"atm_idx":"count"},inplace=True)
return tmp.sort_values("count",ascending=False)
all_categories=pd.concat([cat_df(i) for i in institution],ignore_index=True)
all_categories.sort_values("count",ascending=False)
for the 5 selected ATMs that are owned by JPMorgan Chase Bank we have found 13 pizza places in total in the surroundings.
We can try to visualize those categories using a pivot table
pd.pivot_table(all_categories,
values='count',
index=['category'],
columns=['institution'],
aggfunc=np.sum,
fill_value=0)
The granularity of categories in foursquare response is way to low, we don't need that level of detail.
Using the foursquare category endpoint we can build a mapping and reduce the dimension of the category feature.
categories=requests.get('https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'\
.format(CLIENT_ID,CLIENT_SECRET,VERSION))
[i["name"] for i in categories.json()["response"]["categories"]]
The top 10 parent categories are great for our use case.
We can use a recursive function to walk though the category tree and build a mapping that convert to parent category
category_mapping={}
def load_children(parent,childs):
for child in childs:
if(child["name"] in category_mapping):
raise Exception('duplicate', child["name"])
category_mapping[child["name"]]=parent
if child["categories"]:
load_children(parent,child["categories"])
for i in categories.json()["response"]["categories"]:
load_children(i["name"],i["categories"])
category_mapping
all_categories["parent_category"]=all_categories["category"].apply(lambda c:category_mapping.get(c,"na"))
pivot=pd.pivot_table(all_categories,
values='count',
index=['parent_category'],
columns=['institution'],
aggfunc=np.sum,
fill_value=0)
pivot
from matplotlib import pyplot
import seaborn
a4_dims = (11.7, 8.27)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.heatmap(ax=ax, data=pivot, cmap="YlGnBu")